import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from collections import defaultdict
from surprise import (Reader, Dataset, SVD, SVDpp, KNNBaseline, KNNWithMeans,
KNNWithZScore)
from surprise.model_selection import cross_validate
from IPython.display import HTML, display
import pprint
!pip install dataprep
from dataprep.eda import plot
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-mbe456_y because the default path (/home/mromero/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
WARNING: The directory '/home/mromero/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: dataprep in /home/msds2021/mromero/.local/lib/python3.8/site-packages (0.2.14)
Requirement already satisfied: jsonpath-ng<1.6,>=1.5 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (1.5.2)
Requirement already satisfied: jinja2<2.12,>=2.11 in /opt/conda/lib/python3.8/site-packages (from dataprep) (2.11.2)
Requirement already satisfied: tornado==5.0.2 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (5.0.2)
Requirement already satisfied: wordcloud<1.9,>=1.8 in /opt/conda/lib/python3.8/site-packages (from dataprep) (1.8.0)
Requirement already satisfied: pandas<1.1,>=1.0 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (1.0.5)
Requirement already satisfied: requests<2.25,>=2.24 in /opt/conda/lib/python3.8/site-packages (from dataprep) (2.24.0)
Requirement already satisfied: tqdm<4.49,>=4.48 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (4.48.2)
Requirement already satisfied: pydantic<2.0.0,>=1.6.1 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (1.7.2)
Requirement already satisfied: bottleneck<2.0.0,>=1.3.2 in /opt/conda/lib/python3.8/site-packages (from dataprep) (1.3.2)
Requirement already satisfied: bokeh<2.1,>=2.0 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (2.0.2)
Requirement already satisfied: dask[array,dataframe,delayed]<2.26,>=2.25 in /opt/conda/lib/python3.8/site-packages (from dataprep) (2.25.0)
Requirement already satisfied: aiohttp<3.7,>=3.6 in /opt/conda/lib/python3.8/site-packages (from dataprep) (3.6.2)
Requirement already satisfied: pillow<7.3,>=7.2 in /opt/conda/lib/python3.8/site-packages (from dataprep) (7.2.0)
Requirement already satisfied: regex<2020.11.0,>=2020.10.15 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (2020.10.28)
Requirement already satisfied: scipy<1.6,>=1.5 in /opt/conda/lib/python3.8/site-packages (from dataprep) (1.5.2)
Requirement already satisfied: ipywidgets<8.0.0,>=7.5.1 in /opt/conda/lib/python3.8/site-packages (from dataprep) (7.5.1)
Requirement already satisfied: nltk<4.0,>=3.5 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (3.5)
Requirement already satisfied: numpy<1.20,>=1.19 in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from dataprep) (1.19.3)
Requirement already satisfied: ply in /home/msds2021/mromero/.local/lib/python3.8/site-packages (from jsonpath-ng<1.6,>=1.5->dataprep) (3.11)
Requirement already satisfied: decorator in /opt/conda/lib/python3.8/site-packages (from jsonpath-ng<1.6,>=1.5->dataprep) (4.4.2)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from jsonpath-ng<1.6,>=1.5->dataprep) (1.15.0)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.8/site-packages (from jinja2<2.12,>=2.11->dataprep) (1.1.1)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.8/site-packages (from wordcloud<1.9,>=1.8->dataprep) (3.3.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.8/site-packages (from pandas<1.1,>=1.0->dataprep) (2020.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.8/site-packages (from pandas<1.1,>=1.0->dataprep) (2.8.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests<2.25,>=2.24->dataprep) (1.25.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<2.25,>=2.24->dataprep) (2020.6.20)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests<2.25,>=2.24->dataprep) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<2.25,>=2.24->dataprep) (2.10)
Requirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/lib/python3.8/site-packages (from bokeh<2.1,>=2.0->dataprep) (3.7.4.2)
Requirement already satisfied: packaging>=16.8 in /opt/conda/lib/python3.8/site-packages (from bokeh<2.1,>=2.0->dataprep) (20.4)
Requirement already satisfied: PyYAML>=3.10 in /opt/conda/lib/python3.8/site-packages (from bokeh<2.1,>=2.0->dataprep) (5.3.1)
Requirement already satisfied: toolz>=0.8.2; extra == "array" in /opt/conda/lib/python3.8/site-packages (from dask[array,dataframe,delayed]<2.26,>=2.25->dataprep) (0.11.1)
Requirement already satisfied: fsspec>=0.6.0; extra == "dataframe" in /opt/conda/lib/python3.8/site-packages (from dask[array,dataframe,delayed]<2.26,>=2.25->dataprep) (0.8.3)
Requirement already satisfied: partd>=0.3.10; extra == "dataframe" in /opt/conda/lib/python3.8/site-packages (from dask[array,dataframe,delayed]<2.26,>=2.25->dataprep) (1.1.0)
Requirement already satisfied: cloudpickle>=0.2.2; extra == "delayed" in /opt/conda/lib/python3.8/site-packages (from dask[array,dataframe,delayed]<2.26,>=2.25->dataprep) (1.6.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp<3.7,>=3.6->dataprep) (1.6.0)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp<3.7,>=3.6->dataprep) (20.2.0)
Requirement already satisfied: multidict<5.0,>=4.5 in /opt/conda/lib/python3.8/site-packages (from aiohttp<3.7,>=3.6->dataprep) (4.7.6)
Requirement already satisfied: async-timeout<4.0,>=3.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp<3.7,>=3.6->dataprep) (3.0.1)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/conda/lib/python3.8/site-packages (from ipywidgets<8.0.0,>=7.5.1->dataprep) (5.3.4)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /opt/conda/lib/python3.8/site-packages (from ipywidgets<8.0.0,>=7.5.1->dataprep) (7.18.1)
Requirement already satisfied: traitlets>=4.3.1 in /opt/conda/lib/python3.8/site-packages (from ipywidgets<8.0.0,>=7.5.1->dataprep) (5.0.4)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /opt/conda/lib/python3.8/site-packages (from ipywidgets<8.0.0,>=7.5.1->dataprep) (3.5.1)
Requirement already satisfied: nbformat>=4.2.0 in /opt/conda/lib/python3.8/site-packages (from ipywidgets<8.0.0,>=7.5.1->dataprep) (5.0.7)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from nltk<4.0,>=3.5->dataprep) (0.17.0)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from nltk<4.0,>=3.5->dataprep) (7.1.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib->wordcloud<1.9,>=1.8->dataprep) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->wordcloud<1.9,>=1.8->dataprep) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/conda/lib/python3.8/site-packages (from matplotlib->wordcloud<1.9,>=1.8->dataprep) (2.4.7)
Requirement already satisfied: locket in /opt/conda/lib/python3.8/site-packages (from partd>=0.3.10; extra == "dataframe"->dask[array,dataframe,delayed]<2.26,>=2.25->dataprep) (0.2.0)
Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.8/site-packages (from ipykernel>=4.5.1->ipywidgets<8.0.0,>=7.5.1->dataprep) (6.1.7)
Requirement already satisfied: pickleshare in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.7.5)
Requirement already satisfied: pygments in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (2.7.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (3.0.7)
Requirement already satisfied: backcall in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.2.0)
Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.17.2)
Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (49.6.0.post20200917)
Requirement already satisfied: pexpect>4.3; sys_platform != "win32" in /opt/conda/lib/python3.8/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (4.8.0)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.8/site-packages (from traitlets>=4.3.1->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.2.0)
Requirement already satisfied: notebook>=4.4.1 in /opt/conda/lib/python3.8/site-packages (from widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (6.1.4)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (3.2.0)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.8/site-packages (from nbformat>=4.2.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (4.6.3)
Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.8/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets<8.0.0,>=7.5.1->dataprep) (19.0.2)
Requirement already satisfied: wcwidth in /opt/conda/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.2.5)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in /opt/conda/lib/python3.8/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.7.1)
Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.8/site-packages (from pexpect>4.3; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.6.0)
Requirement already satisfied: terminado>=0.8.3 in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.9.1)
Requirement already satisfied: nbconvert in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (6.0.7)
Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (1.5.0)
Requirement already satisfied: argon2-cffi in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (20.1.0)
Requirement already satisfied: prometheus-client in /opt/conda/lib/python3.8/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.5.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /opt/conda/lib/python3.8/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.17.3)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (1.4.2)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.5.0)
Requirement already satisfied: defusedxml in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.3)
Requirement already satisfied: testpath in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.4.4)
Requirement already satisfied: jupyterlab-pygments in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.1.2)
Requirement already satisfied: bleach in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (3.2.1)
Requirement already satisfied: mistune<2,>=0.8.1 in /opt/conda/lib/python3.8/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.8.4)
Requirement already satisfied: cffi>=1.0.0 in /opt/conda/lib/python3.8/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (1.14.3)
Requirement already satisfied: async-generator in /opt/conda/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (1.10)
Requirement already satisfied: nest-asyncio in /opt/conda/lib/python3.8/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (1.4.1)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.8/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (0.5.1)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.8/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets<8.0.0,>=7.5.1->dataprep) (2.20)
Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. NumExpr defaulting to 8 threads.
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
''')
User-based collaborative filtering is a recommendation system that seeks to recommend items to a target user based on recommended items of similar users. This technique was used in the sushi dataset to gather recommendations. The dataset consisted of top 5 sushi ranking preference per user and these were used as a basis of recommendations. The surprise scikit library was used utilizing the KNNBaseLine algorithm which gave the least RMSE of 1.141373. Aside from explicit ratings, a ranking preference can be used as a basis for the creation of user-based collaborative recommender systems.
Recommender system suggests specific items that is expected to be interesting or preferred by specific users based on their rated items. These recommendation systems play a very important nowadays in the business landscape especially now that there lots of transactions. For this project, a recommendation system technique known as User-Based Collaborative Filtering is employed.
In user-based recommendation systems, similar users are identified to the targeted user who needs a recommendation. This is very prominent in several applications like Netflix (for show recommendations), or Zomato (for food recommendations).
Most, if not all, recommendation systems employ the Semantic Differential method to measure a user's preference. In this method, users specify identify what items they like through a ranking system, similar to the one below:
highly preferred [5 4 3 2 1] not preferred
For this project, we will try to see if this ranking method is effective in gathering recommendations outside of the preferred items of the user.
In the creation of recommendation systems, plenty of inputs can be considered as a basis for recommendation. One can use a binary preferential system, wherein users simply put 1 if they like the item, or 0 if otherwise. It could be better to use a ranking data on seen data to gain recommendations from a completely unseen data set and it is positive that the recommendation will be liked. However, the caveat is that it is impossible to find out if it is successful unless a follow-through questioning to the user is done (Do you like the item recommended to you?). Nonetheless, recommendations borne out of this method may be a good baseline.
This project then seeks to answer the following question: Can a recommendation system be made out of ranking preference data of users?
Interestingly, recommendation systems can bring out insights about a culture which can be utilized. Such is the case with analyzing Sushi.
Sushi, whose origin comes from Japan, is a well known food in the said country. For us outside Japan, we only encounter a few types of sushi. But in Japan, there are plenty of types of Sushi that differ in each location and prefecture.
Each type of sushi carries with it a unique ingredient. Listed below are the sushis that will be used in this study.
| index | Sushi Name | Ingredients |
|---|---|---|
| 0 | ebi | (shrimp) |
| 1 | anago | (sea eel) |
| 2 | maguro | (tuna) |
| 3 | ika | (squid) |
| 4 | uni | (sea urchin) |
| 5 | tako | (octopus) |
| 6 | ikura | (salmon roe) |
| 7 | tamago | (egg) |
| 8 | toro | (fatty tuna) |
| 9 | amaebi | (AMA shrimp) |
| 10 | hotategai | (scallop) |
| 11 | tai | (sea bream) |
| 12 | akagai | (ark shell) |
| 13 | hamachi | (young yellowtail) |
| 14 | awabi | (abalone) |
| 15 | samon | (salmon) |
| 16 | kazunoko | (herring roe) |
| 17 | shako | (squilla) |
| 18 | saba | (mackerel) |
| 19 | chu_toro | (mildly_fatty tuna) |
| 20 | hirame | (flatfish) |
| 21 | aji | (horse mackerel) |
| 22 | kani | (crab) |
| 23 | kohada | (medium_sized KONOSHIRO gizzard shad) |
| 24 | torigai | (TORI_clam) |
| 25 | unagi | (eel) |
| 26 | tekka_maki | (tuna roll) |
| 27 | kanpachi | (amberjack) |
| 28 | mirugai | (MIRU_clam) |
| 29 | kappa_maki | (cucumber roll) |
| 30 | geso | (squid feet) |
| 31 | katsuo | (oceanic bonito) |
| 32 | iwashi | (sardine) |
| 33 | hokkigai | (HOKKI-clam) |
| 34 | shimaaji | (hardtail) |
| 35 | kanimiso | (crab liver) |
| 36 | engawa | (flesh from around the base of the dorsal and ventral fins of a flounder or flatfish) |
| 37 | negi_toro | (fatty flesh of tuna minced to a paste and mixed with chopped green leaves of Welsh onions) |
| 38 | nattou_maki | (fermented bean roll) |
| 39 | sayori | (halfbeak) |
| 40 | takuwan_maki | (DAIKON pickles roll) |
| 41 | botanebi | (BOTAN shrimp) |
| 42 | tobiko | (flying fish roe) |
| 43 | inari | (fried tofu wrapper; http://en.wikipedia.org/wiki/Sushi) |
| 44 | mentaiko | (chili cod roe) |
| 45 | sarada | (salad) |
| 46 | suzuki | (sea bass) |
| 47 | tarabagani | (king crab) |
| 48 | ume_shiso_maki | (pickled plum & perilla leaf roll) |
| 49 | komochi_konbu | (herring roe & sea tangle) |
| 50 | tarako | (cod roe) |
| 51 | sazae | (turban shell) |
| 52 | aoyagi | (meat of a trough shell) |
| 53 | toro_samon | (fatty tuna & salmon) |
| 54 | sanma | (Pacific saury) |
| 55 | hamo | (pike conger) |
| 56 | nasu | (egg plant) |
| 57 | shirauo | (Japanese icefish) |
| 58 | nattou | (fermented bean) |
| 59 | ankimo | (angler liver) |
| 60 | kanpyo_maki | (pickled gourd_maki) |
| 61 | negi_toro_maki | (roll style of no.37) |
| 62 | gyusashi | (raw beef) |
| 63 | hamaguri | (clam) |
| 64 | basashi | (raw horse meat) |
| 65 | fugu | (blowfish) |
| 66 | tsubugai | (TSUBU_shell) |
| 67 | ana_kyu_maki | (sea eel & cucumber roll) |
| 68 | hiragai | (=tairagi; pen shell) |
| 69 | okura | (gumbo) |
| 70 | ume_maki | (pickled plum roll) |
| 71 | sarada_maki | (salad roll) |
| 72 | mentaiko_maki | (chili cod roe roll) |
| 73 | buri | (yellowtail) |
| 74 | shiso_maki | (perilla leaf roll) |
| 75 | ika_nattou | (squid & fermented bean) |
| 76 | zuke | (tuna pickled in soy sauce) |
| 77 | himo | (part of clam) |
| 78 | kaiware | (DAIKON radish sprouts) |
| 79 | kurumaebi | (prawn) |
| 80 | mekabu | (part of tangle) |
| 81 | kue | (kind of cabrilla) |
| 82 | sawara | (Japanese Spanish mackerel) |
| 83 | sasami | (kind of raw chicken) |
| 84 | kujira | (whale) |
| 85 | kamo | (wild duck) |
| 86 | himo_kyu_maki | (part of clam & cucumber roll) |
| 87 | tobiuo | (flying fish) |
| 88 | ishigakidai | (ishigaki sea bream) |
| 89 | mamakari | (Japanese scaled sardine) |
| 90 | hoya | (ascidian) |
| 91 | battera | (OSHIZUSHI style mackerel) |
| 92 | kyabia | (caviar) |
| 93 | karasumi | (dried mullet roe) |
| 94 | uni_kurage | (sea urchin & jellyfish) |
| 95 | karei | (flounder) |
| 96 | hiramasa | (something like amberjack) |
| 97 | namako | (sea cucumber) |
| 98 | shishamo | (smelt) |
| 99 | kaki | (oyster) |
Since there alot of sushis, it is highly probable that not one individual has tried all sushi types. Thus, it might be good for a recommendation based system to be used to give other possible sushi types the individual might like. Businesses could leverage on this insight to get more sales, and maybe offer sushi that isn't offered in their area yet but might prove to be a hit.
The dataset used came from data curated by Toshihori Kamishima called the Sushi Preference Dataset. He describes the dataset as follows:
The SUSHI Preference Data Set includes responses of a questionnaire survey of preference in SUSHI. These preference are collected by a scoring method using a five-point-scale, and additionally by a ranking method. A ranking method is a one of method for performing a sensory test. In this method, the respondents sort given objects according to their preference order. This data set also includes demographic data of respondents and features of SUSHI.
Taken from kamishima.net.
Stated in Fig. 3 are the sushis in to be used in this dataset. Individuals in the survey simply chose and ranked 5 sushis at most.
In order to address the problem statement and arrive at the desired results, the data has to undergo specific preprocessing and analytical pipeline leading to creation of the recommendation system. The pipeline that it will employ is as follows:
These sections will have a section of their own and will be discussed further from hereon.
The sushi preference data to be used is aggregated and stored in a folder placed in the Jojie Public Library. The folder contains the following:
README-ja.txt README (Japanese)
README-en.txt README (English)
README-stat-ja.txt summary statics of this data set (Japanese)
sushi3.idata features of items (=SUSHIs)
sushi3.udata features of users
sushi3b.5000.10.score preference score of SUSHIs
Two datasets will be used and preprocessed, sushi.idata and sushi3b.5000.10.score. The former will be used for some exploratory data analysis steps and the latter will be the main dataset in the creation of the recommendation system.
Prior to preprocessing the datasets mentioned, the names of the sushi existing in the dataset will be placed in a dictionary. The use of this will be evident later on.
sushi_names = {
0:'ebi', 1:'anago', 2:'maguro', 3:'ika', 4:'uni', 5:'tako', 6:'ikura',
7:'tamago', 8:'toro', 9:'amaebi', 10:'hotategai', 11:'tai', 13:'hamachi',
15:'samon', 16:'kazunoko', 17:'shako', 18:'saba', 19:'chu_toro',
20:'hirame', 21:'aji', 22:'kani', 23:'kohada', 24:'torigai', 25:'unagi',
26:'tekka_maki', 27:'kanpachi', 28:'mirugai', 29:'kappa_maki', 30:'geso',
31:'katsuo', 32:'iwashi', 33:'hokkigai', 34:'shimaaji', 35:'kanimiso',
36:'engawa', 37:'negi_toro', 38:'nattou_maki', 39:'sayori',
40:'takuwan_maki', 41:'botanebi', 42:'tobiko', 43:'inari', 44:'mentaiko',
45:'sarada', 46:'suzuki', 47:'tarabagani', 48:'ume_shiso_maki',
49:'komochi_konbu', 50:'tarako', 51:'sazae', 52:'aoyagi', 53:'toro_samon',
54:'sanma', 55:'hamo', 56:'nasu', 57:'shirauo', 58:'nattou', 59:'ankimo',
60:'kanpyo_maki', 61:'negi_toro_maki', 62:'gyusashi', 63:'hamaguri',
64:'basashi', 65:'fugu', 66:'tsubugai', 67:'ana_kyu_maki', 68:'hiragai',
69:'okura', 70:'ume_maki', 71:'sarada_maki', 72:'mentaiko_maki',
73:'buri', 74:'shiso_maki', 75:'ika_nattou', 76:'zuke', 77:'himo',
78:'kaiware', 79:'kurumaebi', 80:'mekabu', 81:'kue', 82:'sawara',
83:'sasami', 84:'kujira', 85:'kamo', 86:'himo_kyu_maki', 88:'ishigakidai',
89:'mamakari', 90:'hoya', 91:'battera', 92:'kyabia', 93:'karasumi',
94:'uni_kurage', 95:'karei', 96:'hiramasa', 97:'namako', 98:'shishamo',
99:'kaki', 14: 'awabi', 87: 'tobiuo', 12: 'akagai'
}
The characteristics of the sushis will be extracted from the dataset and saved in the dataframe. For reference, the following will be the characteristical focus (Take note that each row corresponds to the specific sushi [refer to Fig3] in the dataset and each column represent the features presented below):
1. style 0:maki 1:otherwise (see Wikipedia)
2. major group 0:seafood 1:otherwise
0 corresponds to the minor group nos 0--8.
3. minor group
0:aomono (blue-skinned fish)
1:akami (red meat fish)
2:shiromi (white-meat fish)
3:tare (something like baste; for eel or sea eel)
4:clam or shell
5:squid or octopus
6:shrimp or crab
7:roe
8:other seafood
9:egg
10:meat other than fish
11:vegetables
4. the heaviness/oiliness in taste, range[0-4] 0:heavy/oily
5. how frequently the user eats the SUSHI, range[0-3] 3:frequently eat
6. normalized price
7. how frequently the SUSHI is sold in sushi shop, range[0-1] 1:the most frequently
If loaded, the characteristics are not suitable to be viewed in tabular form. Thus, preprocessing steps like renaming the columns to a specfic characteristic, tranforming some values in the minor characteristic into a dummy format, and converting the whole data into float type is down.
Lastly, the rows, which consist of integers from 0-99 will be replaced with their corresponding sushi names.
The following steps are done in the code below.
def get_sushis():
'''Get characteristics of Sushis in dataset'''
filepath = '/mnt/data/public/sushipref/sushi3-2016/sushi3.idata'
df = pd.read_csv(filepath,sep='\t')
df = pd.DataFrame(np.vstack([df.columns, df]))
a = df.columns.to_list()
b = ['item_id','name','style', 'major', 'minor', 'oiliness',
'eating_frequency', 'price', 'selling_frequency']
for i, j in zip(a, b):
df = df.rename(columns={i:j})
df['minor'] = df['minor'].astype(int)
df = pd.get_dummies(df, columns=['minor'])
df = df[['style', 'major', 'minor_0', 'minor_1', 'minor_2', 'minor_3',
'minor_4','minor_5', 'minor_6', 'minor_7', 'minor_8',
'minor_9', 'minor_10', 'minor_11', 'oiliness',
'eating_frequency', 'price', 'selling_frequency']]
for i in df.columns[0:13]:
df[i] = df[i].astype(float)
minors = {'minor_0':'blue-skinned fish',
'minor_1':'red meat fish',
'minor_2': 'white-meat fish',
'minor_3':'eel',
'minor_4':'clam or shell',
'minor_5':'squid or octopus',
'minor_6':'shrimp or crab',
'minor_7':'roe',
'minor_8':'other seafood',
'minor_9':'egg',
'minor_10':'meat other than fish',
'minor_11':'vegetables'}
df = df.rename(index=sushi_names, columns=minors)
return df.astype(float)
sushi_characteristics = get_sushis()
sushi_characteristics.head()
| style | major | blue-skinned fish | red meat fish | white-meat fish | eel | clam or shell | squid or octopus | shrimp or crab | roe | other seafood | egg | meat other than fish | vegetables | oiliness | eating_frequency | price | selling_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ebi | 1.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.728978 | 2.138422 | 1.838420 | 0.84 |
| anago | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.926384 | 1.990228 | 1.992459 | 0.88 |
| maguro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.769559 | 2.348506 | 1.874725 | 0.88 |
| ika | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.688401 | 2.043240 | 1.515152 | 0.92 |
| uni | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.813043 | 1.643478 | 3.287282 | 0.88 |
In this step, the main dataset to be used, the sushi preference data set, is extracted from the said folder in Jojie. Since the data set is not in csv format, when loaded via pd.read_csv, data appears to be very unstructured. With the use of sep and making the header be equal to 'None', data will be viewable via a dataframe.
Initially loaded, some data points have the value -1, indicating that the specific sushi is not rated. This needs to be changed to np.nan for the recommendation system we will use later. At this point, the sushi names are transposed to the existing column names which only consists of numbers for easier referencing.
Note that the row indexes here correspond to a user.
def sushi_df():
'''Returns a matrix of sushis'''
df = pd.read_csv('/mnt/data/public/sushipref/sushi3-2016/'
'sushi3b.5000.10.score',
sep=' ',
header=None)
df = df.rename(columns=sushi_names)
return df.replace(-1, np.nan)
sushi_main = sushi_df()
sushi_main.head()
| ebi | anago | maguro | ika | uni | tako | ikura | tamago | toro | amaebi | ... | hoya | battera | kyabia | karasumi | uni_kurage | karei | hiramasa | namako | shishamo | kaki | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 0.0 | NaN | 4.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | 3.0 | 4.0 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4.0 | NaN | NaN | 3.0 | 4.0 | 1.0 | NaN | NaN | 4.0 | 3.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 4.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
5 rows × 100 columns
sushi_main.shape
(5000, 100)
Fig. 4 Shape of Dataset
In here we see that there are 5000 users in the dataset.
At this step we seek to gather some insights from the data as it is presented.
What does this data tell us about the sushis in existence? As a preliminary step, we plot the univariate distributions of the dataset.
Note: When the data loads, click on Bar Chart
sc = sushi_characteristics.copy()
sc_chars = sc.copy()
sc_chars = sc_chars.iloc[:, 2:13]
sc_chars = sc_chars.reset_index()
sc_chars_melted = pd.melt(sc_chars, id_vars=['index'])
sc_chars_melted = sc_chars_melted.loc[sc_chars_melted.value == 1]
plot(sc_chars_melted, 'variable')
| Distinct Count | 11 |
|---|---|
| Unique (%) | 12.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 6.5 KB |
| Mean | 13.0706 |
|---|---|
| Standard Deviation | 4.7202 |
| Median | 13 |
| Minimum | 3 |
| Maximum | 20 |
| 1st row | blue-skinned fish |
|---|---|
| 2nd row | blue-skinned fish |
| 3rd row | blue-skinned fish |
| 4th row | blue-skinned fish |
| 5th row | blue-skinned fish |
| Count | 968 |
|---|---|
| Lowercase Letter | 968 |
| Space Separator | 117 |
| Uppercase Letter | 0 |
| Dash Punctuation | 26 |
| Decimal Number | 0 |
Fig. 5 Ingredient Distributions
Most sushis in existence appear to be made with seafood, most especially with the blue-skinned fish. This type of fish is abundant in Japan, and is said to be a must try food.
Let's explore some relationships in the data.
plot(sc, 'price', 'oiliness')
Fig. 6 Oiliness by Price
We see that this graph is somewhat indicative of the notion that the oilier the sushi, the cheaper it is. This could be attributed to some ingredients. Hard to find or more expensive ingredients like salmon are drier in mouth feel and indeed pricier in some restaurants. One could make these assumptions with cheaper types of fish, wherein easier to catch fishes are often scaly and oilier and are sold cheap.
plot(sc, 'oiliness', 'eating_frequency')
Fig. 7 Eating Frequency by Oiliness
The relationship between the two are not that strong however if we look at the distribution of the points, it skews towards the lower right, indicating that there is an evidence that the oilier the sushi, the lesser chance it will be eaten.
def sushi_top10(sushi=sushi_df):
'''Visualize the Top 10 Sushis'''
sns.set()
plt.figure(figsize=(10, 5))
sushi_df().mean().sort_values(ascending=False)[0:10].plot(kind='bar',
cmap='summer')
plt.title('Top 10 Preferred Sushis')
plt.xticks(rotation=45)
plt.show()
sushi_top10(sushi_df())
Seen here are the top 10 most preferred sushis at least from the existing dataset. To have a better idea what ingredients the respondents like, their characteristics will be extracted.
sushi10 = sushi_characteristics.loc[['chu_toro', 'toro', 'negi_toro',
'maguro', 'tarabagani', 'amaebi',
'negi_toro_maki', 'ebi', 'kurumaebi',
'kani']]
sushi10
| style | major | blue-skinned fish | red meat fish | white-meat fish | eel | clam or shell | squid or octopus | shrimp or crab | roe | other seafood | egg | meat other than fish | vegetables | oiliness | eating_frequency | price | selling_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chu_toro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.795193 | 2.034483 | 3.167569 | 0.56 |
| toro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.551855 | 2.057532 | 4.485455 | 0.80 |
| negi_toro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.255639 | 2.075188 | 2.472944 | 0.28 |
| maguro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.769559 | 2.348506 | 1.874725 | 0.88 |
| tarabagani | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.680328 | 1.295082 | 3.415152 | 0.20 |
| amaebi | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.913987 | 2.068328 | 1.924973 | 0.76 |
| negi_toro_maki | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.260684 | 2.000000 | 2.362072 | 0.12 |
| ebi | 1.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.728978 | 2.138422 | 1.838420 | 0.84 |
| kurumaebi | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.424837 | 1.686275 | 3.200000 | 0.08 |
| kani | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.823733 | 1.474654 | 2.037879 | 0.52 |
It seems that most of the respondents like red meat fish and shrimp or crab based sushis. Maki and maguro sushi have these ingredients and they are indeed popular. Most of these can be seen in the Philippines as well.
Now, the recommendation system is created. For this step, we employ the use of the Surprise library. The Surprise library is a python scikit for building and analyzing explicit rating data.
knn = KNNBaseline(k=10, sim_options={'name': 'pearson',
'user_based': True})
reader = Reader(rating_scale=(0, 4))
df_melt = (sushi_df().reset_index()
.melt('index', var_name='itemID',
value_name='raw_rating')
.dropna())
dataset = Dataset.load_from_df(df_melt, reader)
knn.fit(dataset.build_full_trainset())
knn.predict(0, 1)
knn.test(knn.trainset.build_anti_testset())
# don't forget to include the semicolon below before submitting
knn.test(knn.trainset.build_testset());
predictions = knn.test(knn.trainset.build_testset())
Estimating biases using als... Computing the pearson similarity matrix... Done computing similarity matrix.
The algorithm above prepares our data and in turn creates predictions for the recommendations to the users. Note that only 5 sushis were rated by the users.
We opt to use $k=10$ nearest neigbors due to the vast size of the dataset and it is safe to assume that 10 users similar to the target user is enough to get predicted recommendations.
# from https://surprise.readthedocs.io/en/stable/FAQ.html
def get_top_n(predictions, n=10):
'''
Get the top n predictions per user
Parameters
----------
predictions : list
list of predicted recommended items for target user
n : integer
default = 10. display `n` recommended items per user
Returns
-------
top_n : list
list of top `n` recommended items per user
'''
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
The function above allows us to get the top n items per user. This will allows to see which could be recommended more to the users other than the 5 sushis that they rated.
top_n = get_top_n(predictions, n=10)
surprise_items = [(uid, [iid for (iid, _) in user_ratings])
for uid, user_ratings in top_n.items()]
def take_first(elem):
'''Sorting key that takes the first element of `elem`'''
return elem[0]
surprise_items.sort(key=take_first)
At this point, the recommendation system is finished. The sample output of the system is shown below.
surprise_items[:5]
[(0, ['ika', 'nattou', 'uni', 'kanpyo_maki', 'mentaiko', 'tobiuo', 'anago', 'ana_kyu_maki', 'shiso_maki', 'akagai']), (1, ['kani', 'aji', 'toro', 'katsuo', 'kanpyo_maki', 'torigai', 'zuke', 'ikura', 'shako', 'akagai']), (2, ['maguro', 'samon', 'hamachi', 'anago', 'ikura', 'unagi', 'hamo', 'shiso_maki', 'suzuki', 'nasu']), (3, ['ebi', 'toro', 'uni', 'amaebi', 'ika', 'torigai', 'inari', 'kappa_maki', 'tako', 'takuwan_maki']), (4, ['tarabagani', 'amaebi', 'chu_toro', 'tarako', 'geso', 'uni', 'hamachi', 'katsuo', 'hamo', 'kaki'])]
In the initial creation of the recommendation system, the KNNBaseline() algorithm was used. In this part, a validation check is performed to justify the choice of the said algorithm.
It is important to note that Surprise has several algorithms that one can use depending on the structure of the data and the objective.
The algorithm below shows the algorithms to check and the RMSE score of the best algorithm, which was KNNBaseline().
rows = []
# Iterate over all algorithms
def reco_validation():
'''Compare surprise algorithms and show RMSE Scores'''
for algo in [SVD(), SVDpp(), KNNBaseline(), KNNWithMeans(),
KNNWithZScore()]:
# Perform cross validation, three splits
results = cross_validate(algo, dataset, measures=['RMSE'],
cv=3, verbose=False)
# Get results
temp = pd.DataFrame.from_dict(results).mean(axis=0)
temp = temp.append(pd.Series([str(algo).split(' ')[0].split('.')[-1]],
index=['Algorithm']))
rows.append(temp)
return pd.DataFrame(rows).set_index('Algorithm').sort_values('test_rmse')
reco_validation()
Estimating biases using als... Computing the msd similarity matrix... Done computing similarity matrix. Estimating biases using als... Computing the msd similarity matrix... Done computing similarity matrix. Estimating biases using als... Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix.
| test_rmse | fit_time | test_time | |
|---|---|---|---|
| Algorithm | |||
| KNNBaseline | 1.143359 | 1.930501 | 7.296964 |
| SVD | 1.163509 | 1.969734 | 0.171080 |
| SVDpp | 1.168507 | 6.334041 | 0.362333 |
| KNNWithMeans | 1.175234 | 1.945160 | 6.977105 |
| KNNWithZScore | 1.183598 | 2.159754 | 7.053932 |
Fig. 11 Statistics of the Different Surprise Algorithms
KNNBaseline, as defined by Surprise is a basic collaborative filtering algorithm taking into account a baseline rating. Since the users initially had a 'baseline' of ratings, these baseline of recommended sushis were used to get other recommended items for the user. Further strengthened by the low RMSE score, the use of KNNBaseline is justified.
The KNNBaseline is mathematically represented by the equation below:
At this point, recommendations made by the system are investigated. What insights could be seen solely based on the recommendations?
surprise_recos = [i[1] for i in surprise_items]
sushi_recos_top10 = (pd.DataFrame(surprise_recos).stack()
.value_counts().index[:10]).to_list()
sushi_characteristics.loc[sushi_recos_top10]
| style | major | blue-skinned fish | red meat fish | white-meat fish | eel | clam or shell | squid or octopus | shrimp or crab | roe | other seafood | egg | meat other than fish | vegetables | oiliness | eating_frequency | price | selling_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ebi | 1.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.728978 | 2.138422 | 1.838420 | 0.84 |
| anago | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.926384 | 1.990228 | 1.992459 | 0.88 |
| ika | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.688401 | 2.043240 | 1.515152 | 0.92 |
| tako | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.088459 | 1.717346 | 1.384330 | 0.76 |
| ikura | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.264873 | 1.979462 | 2.695363 | 0.88 |
| maguro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.769559 | 2.348506 | 1.874725 | 0.88 |
| uni | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.813043 | 1.643478 | 3.287282 | 0.88 |
| tamago | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.368071 | 1.866223 | 1.032468 | 0.84 |
| toro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.551855 | 2.057532 | 4.485455 | 0.80 |
| hotategai | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.345412 | 1.785659 | 1.772196 | 0.76 |
The table above show the top 10 recommended sushis for all users. What's fascinating about this is that most of these recommended sushis have the same make up and ingredients as in the Top 10 preferred sushis [refer to Table 3]. Just by looking at the style column, these recommended sushis have the same style as the most preferred ones. Most of which are made up of seafood as well. This implies that the algorithm takes into account the preferred sushi's ingredients as well when doing recommendations. It could be theorized that the whole population of the surveyed users prefer these types of ingredients among all others, that's why these types of sushis were recommended.
For this insight, the recommendations are compared with the initial preferences.
Let's take a look at user 1.
user1_pref = sushi_characteristics.loc[surprise_recos[1][:5]]
user1_pref
| style | major | blue-skinned fish | red meat fish | white-meat fish | eel | clam or shell | squid or octopus | shrimp or crab | roe | other seafood | egg | meat other than fish | vegetables | oiliness | eating_frequency | price | selling_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kani | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.823733 | 1.474654 | 2.037879 | 0.52 |
| aji | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.323810 | 1.602116 | 2.081313 | 0.56 |
| toro | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.551855 | 2.057532 | 4.485455 | 0.80 |
| katsuo | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.907916 | 1.484653 | 1.464815 | 0.36 |
| kanpyo_maki | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.821918 | 1.214612 | 1.000000 | 0.12 |
user1_reco = sushi_characteristics.loc[surprise_recos[1][5:]]
user1_reco
| style | major | blue-skinned fish | red meat fish | white-meat fish | eel | clam or shell | squid or octopus | shrimp or crab | roe | other seafood | egg | meat other than fish | vegetables | oiliness | eating_frequency | price | selling_frequency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| torigai | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.535077 | 1.110583 | 2.125000 | 0.48 |
| zuke | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.486111 | 1.375000 | 2.125000 | 0.08 |
| ikura | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.264873 | 1.979462 | 2.695363 | 0.88 |
| shako | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.571850 | 0.993110 | 1.856566 | 0.60 |
| akagai | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.516750 | 1.327471 | 3.332449 | 0.72 |
To test this hypothesis, the two groups (pref and reco) are compared using its' characteristics as comparison.
user1_pref.describe().iloc[:, 14]
count 5.000000 mean 2.085846 std 0.939108 min 0.551855 25% 1.907916 50% 2.323810 75% 2.821918 max 2.823733 Name: oiliness, dtype: float64
user1_reco.describe().iloc[:, 14]
count 5.000000 mean 2.074932 std 0.643578 min 1.264873 25% 1.486111 50% 2.516750 75% 2.535077 max 2.571850 Name: oiliness, dtype: float64
The two groups almost have the same level of oiliness, indicating that the system recommends sushis with characteristics close to the initial preferences. Though some ingredients deviate from the original preferences, if the other characteristics are looked at, it is almost the same as the preferred sushis. Such is the case of the sushi torigai in Table 6. It's main ingredient clam or shell. This ingredient was not seen in the original preference sushis but if the other characteristics like oiliness, price, and eating_frequency, it is almost similar, implying that this sushi may be liked by the user.
In this project, it was shown that one can make user-based collaborative filtering recommendation system based on user-preference ranking of $n$ number of items. It was further proved that the recommendations were thematically similar with the initial preference of the user. This shows us that the recommendation system was indeed accurate and performed desirably.
The insights of this study could help many foodies or business owners in choosing what sushis to offer for their restaurant. The plethora of sushis make it difficult to choose one over the other, hence this study aims to alleviate that problem.
Some things can be further looked into when recreating this project.
RMSE score (which was desired), augment the findings with other metrics, if possible.Overall, the hope is that in extending this project, it could provide more valuable insights in letting foodies and business owners know what sushi to choose.
Listed below are the references that greatly helped in the creation of this project.